Machine Learning: AllLife Bank Personal Loan Campaign¶

Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [ ]:
# Installing the libraries with the specified version.
# !pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [ ]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores
from sklearn.metrics import (
  f1_score,
  accuracy_score,
  recall_score,
  precision_score,
  confusion_matrix,
  make_scorer,
)

Loading the dataset¶

In [ ]:
# Mount the drive for Google Coalb
from google.colab import drive
drive.mount('/content/drive/')
Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
In [ ]:
# Read the Loan Modelling csv file
df = pd.read_csv('/content/drive/MyDrive/AIML_LoanCampaign/Loan_Modelling.csv')

Data Overview¶

View the first and last 5 rows of the dataset.¶

In [ ]:
# Returns first 5 rows of the dataframe
df.head()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [ ]:
# Returns last 5 rows of the dataframe
df.tail()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

Understand the shape of the dataset.¶

In [ ]:
df.shape #returns dimension of the dataframe
Out[ ]:
(5000, 14)

Check the data types of the columns for the dataset.¶

In [ ]:
df.info()  # returns the summary of dataframe including the index dtype and columns, non-null values and memory usage.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Checking for missing values¶

In [ ]:
df.isna().values.any()     # Checks if there is any null value in any column
Out[ ]:
False

Statistical summary of the data¶

In [ ]:
df.describe().T  # returns stats for all numerical columns
Out[ ]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Check for Duplicate Values¶

In [ ]:
# Lets check for any duplicate values , and if there is any we will remove them
df[df.duplicated()].count() # check for any duplicate values
Out[ ]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

Check for Unique Values¶

In [ ]:
df.nunique() # returns unique values for each column
Out[ ]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIPCode                467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal_Loan            2
Securities_Account       2
CD_Account               2
Online                   2
CreditCard               2
dtype: int64

Observations - Data OverView¶

There are 5000 rows and 14 columns in the given Loan Campaign Dataframe

Datatypes used in this Loan dataset are all numeric All columns have no missing (i.e., non-null) values.

13 columns are Integer and only CCAvg is a float quantity

Total 14 Columns which indicate 13 features for 1 target

There are no NULL values and duplicate values

Minimum value for Experince is negative(-3) which is not possible. Needs Treatment

Though ZIPCode is numeric column - it could be categorical here - May need treatment

Total memory used = 547.0 KB

Checking for Unusual values¶

In [ ]:
df.nunique() # returns unique values for each column
Out[ ]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIPCode                467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal_Loan            2
Securities_Account       2
CD_Account               2
Online                   2
CreditCard               2
dtype: int64
In [ ]:
df["Age"].unique() # returns unique Age of customers
Out[ ]:
array([25, 45, 39, 35, 37, 53, 50, 34, 65, 29, 48, 59, 67, 60, 38, 42, 46,
       55, 56, 57, 44, 36, 43, 40, 30, 31, 51, 32, 61, 41, 28, 49, 47, 62,
       58, 54, 33, 27, 66, 24, 52, 26, 64, 63, 23])
In [ ]:
df["Experience"].unique() # returns unique Experience for customers
Out[ ]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, -1, 34,  0, 38, 40, 33,  4, -2, 42, -3, 43])
In [ ]:
df["Income"].unique() # returns unique Income of customers
Out[ ]:
array([ 49,  34,  11, 100,  45,  29,  72,  22,  81, 180, 105, 114,  40,
       112, 130, 193,  21,  25,  63,  62,  43, 152,  83, 158,  48, 119,
        35,  41,  18,  50, 121,  71, 141,  80,  84,  60, 132, 104,  52,
       194,   8, 131, 190,  44, 139,  93, 188,  39, 125,  32,  20, 115,
        69,  85, 135,  12, 133,  19,  82, 109,  42,  78,  51, 113, 118,
        64, 161,  94,  15,  74,  30,  38,   9,  92,  61,  73,  70, 149,
        98, 128,  31,  58,  54, 124, 163,  24,  79, 134,  23,  13, 138,
       171, 168,  65,  10, 148, 159, 169, 144, 165,  59,  68,  91, 172,
        55, 155,  53,  89,  28,  75, 170, 120,  99, 111,  33, 129, 122,
       150, 195, 110, 101, 191, 140, 153, 173, 174,  90, 179, 145, 200,
       183, 182,  88, 160, 205, 164,  14, 175, 103, 108, 185, 204, 154,
       102, 192, 202, 162, 142,  95, 184, 181, 143, 123, 178, 198, 201,
       203, 189, 151, 199, 224, 218])
In [ ]:
df["ZIPCode"].unique() # returns unique Zipcode of customers
Out[ ]:
array([91107, 90089, 94720, 94112, 91330, 92121, 91711, 93943, 93023,
       94710, 90277, 93106, 94920, 91741, 95054, 95010, 94305, 91604,
       94015, 90095, 91320, 95521, 95064, 90064, 94539, 94104, 94117,
       94801, 94035, 92647, 95814, 94114, 94115, 92672, 94122, 90019,
       95616, 94065, 95014, 91380, 95747, 92373, 92093, 94005, 90245,
       95819, 94022, 90404, 93407, 94523, 90024, 91360, 95670, 95123,
       90045, 91335, 93907, 92007, 94606, 94611, 94901, 92220, 93305,
       95134, 94612, 92507, 91730, 94501, 94303, 94105, 94550, 92612,
       95617, 92374, 94080, 94608, 93555, 93311, 94704, 92717, 92037,
       95136, 94542, 94143, 91775, 92703, 92354, 92024, 92831, 92833,
       94304, 90057, 92130, 91301, 92096, 92646, 92182, 92131, 93720,
       90840, 95035, 93010, 94928, 95831, 91770, 90007, 94102, 91423,
       93955, 94107, 92834, 93117, 94551, 94596, 94025, 94545, 95053,
       90036, 91125, 95120, 94706, 95827, 90503, 90250, 95817, 95503,
       93111, 94132, 95818, 91942, 90401, 93524, 95133, 92173, 94043,
       92521, 92122, 93118, 92697, 94577, 91345, 94123, 92152, 91355,
       94609, 94306, 96150, 94110, 94707, 91326, 90291, 92807, 95051,
       94085, 92677, 92614, 92626, 94583, 92103, 92691, 92407, 90504,
       94002, 95039, 94063, 94923, 95023, 90058, 92126, 94118, 90029,
       92806, 94806, 92110, 94536, 90623, 92069, 92843, 92120, 95605,
       90740, 91207, 95929, 93437, 90630, 90034, 90266, 95630, 93657,
       92038, 91304, 92606, 92192, 90745, 95060, 94301, 92692, 92101,
       94610, 90254, 94590, 92028, 92054, 92029, 93105, 91941, 92346,
       94402, 94618, 94904, 93077, 95482, 91709, 91311, 94509, 92866,
       91745, 94111, 94309, 90073, 92333, 90505, 94998, 94086, 94709,
       95825, 90509, 93108, 94588, 91706, 92109, 92068, 95841, 92123,
       91342, 90232, 92634, 91006, 91768, 90028, 92008, 95112, 92154,
       92115, 92177, 90640, 94607, 92780, 90009, 92518, 91007, 93014,
       94024, 90027, 95207, 90717, 94534, 94010, 91614, 94234, 90210,
       95020, 92870, 92124, 90049, 94521, 95678, 95045, 92653, 92821,
       90025, 92835, 91910, 94701, 91129, 90071, 96651, 94960, 91902,
       90033, 95621, 90037, 90005, 93940, 91109, 93009, 93561, 95126,
       94109, 93107, 94591, 92251, 92648, 92709, 91754, 92009, 96064,
       91103, 91030, 90066, 95403, 91016, 95348, 91950, 95822, 94538,
       92056, 93063, 91040, 92661, 94061, 95758, 96091, 94066, 94939,
       95138, 95762, 92064, 94708, 92106, 92116, 91302, 90048, 90405,
       92325, 91116, 92868, 90638, 90747, 93611, 95833, 91605, 92675,
       90650, 95820, 90018, 93711, 95973, 92886, 95812, 91203, 91105,
       95008, 90016, 90035, 92129, 90720, 94949, 90041, 95003, 95192,
       91101, 94126, 90230, 93101, 91365, 91367, 91763, 92660, 92104,
       91361, 90011, 90032, 95354, 94546, 92673, 95741, 95351, 92399,
       90274, 94087, 90044, 94131, 94124, 95032, 90212, 93109, 94019,
       95828, 90086, 94555, 93033, 93022, 91343, 91911, 94803, 94553,
       95211, 90304, 92084, 90601, 92704, 92350, 94705, 93401, 90502,
       94571, 95070, 92735, 95037, 95135, 94028, 96003, 91024, 90065,
       95405, 95370, 93727, 92867, 95821, 94566, 95125, 94526, 94604,
       96008, 93065, 96001, 95006, 90639, 92630, 95307, 91801, 94302,
       91710, 93950, 90059, 94108, 94558, 93933, 92161, 94507, 94575,
       95449, 93403, 93460, 95005, 93302, 94040, 91401, 95816, 92624,
       95131, 94965, 91784, 91765, 90280, 95422, 95518, 95193, 92694,
       90275, 90272, 91791, 92705, 91773, 93003, 90755, 96145, 94703,
       96094, 95842, 94116, 90068, 94970, 90813, 94404, 94598])
In [ ]:
len(df["ZIPCode"].unique()) # returns total length for Zipcodes (unique)
Out[ ]:
467
In [ ]:
df["Family"].unique() # returns unique Family of customers
Out[ ]:
array([4, 3, 1, 2])
In [ ]:
df["CCAvg"].unique() # returns unique Credit Card Average spend for all customers
Out[ ]:
array([ 1.6 ,  1.5 ,  1.  ,  2.7 ,  0.4 ,  0.3 ,  0.6 ,  8.9 ,  2.4 ,
        0.1 ,  3.8 ,  2.5 ,  2.  ,  4.7 ,  8.1 ,  0.5 ,  0.9 ,  1.2 ,
        0.7 ,  3.9 ,  0.2 ,  2.2 ,  3.3 ,  1.8 ,  2.9 ,  1.4 ,  5.  ,
        2.3 ,  1.1 ,  5.7 ,  4.5 ,  2.1 ,  8.  ,  1.7 ,  0.  ,  2.8 ,
        3.5 ,  4.  ,  2.6 ,  1.3 ,  5.6 ,  5.2 ,  3.  ,  4.6 ,  3.6 ,
        7.2 ,  1.75,  7.4 ,  2.67,  7.5 ,  6.5 ,  7.8 ,  7.9 ,  4.1 ,
        1.9 ,  4.3 ,  6.8 ,  5.1 ,  3.1 ,  0.8 ,  3.7 ,  6.2 ,  0.75,
        2.33,  4.9 ,  0.67,  3.2 ,  5.5 ,  6.9 ,  4.33,  7.3 ,  4.2 ,
        4.4 ,  6.1 ,  6.33,  6.6 ,  5.3 ,  3.4 ,  7.  ,  6.3 ,  8.3 ,
        6.  ,  1.67,  8.6 ,  7.6 ,  6.4 , 10.  ,  5.9 ,  5.4 ,  8.8 ,
        1.33,  9.  ,  6.7 ,  4.25,  6.67,  5.8 ,  4.8 ,  3.25,  5.67,
        8.5 ,  4.75,  4.67,  3.67,  8.2 ,  3.33,  5.33,  9.3 ,  2.75])
In [ ]:
df["Education"].unique() # returns unique Education of customers
Out[ ]:
array([1, 2, 3])
In [ ]:
df["Mortgage"].unique() # returns unique Mortgage of customers
Out[ ]:
array([  0, 155, 104, 134, 111, 260, 163, 159,  97, 122, 193, 198, 285,
       412, 153, 211, 207, 240, 455, 112, 336, 132, 118, 174, 126, 236,
       166, 136, 309, 103, 366, 101, 251, 276, 161, 149, 188, 116, 135,
       244, 164,  81, 315, 140,  95,  89,  90, 105, 100, 282, 209, 249,
        91,  98, 145, 150, 169, 280,  99,  78, 264, 113, 117, 325, 121,
       138,  77, 158, 109, 131, 391,  88, 129, 196, 617, 123, 167, 190,
       248,  82, 402, 360, 392, 185, 419, 270, 148, 466, 175, 147, 220,
       133, 182, 290, 125, 124, 224, 141, 119, 139, 115, 458, 172, 156,
       547, 470, 304, 221, 108, 179, 271, 378, 176,  76, 314,  87, 203,
       180, 230, 137, 152, 485, 300, 272, 144,  94, 208, 275,  83, 218,
       327, 322, 205, 227, 239,  85, 160, 364, 449,  75, 107,  92, 187,
       355, 106, 587, 214, 307, 263, 310, 127, 252, 170, 265, 177, 305,
       372,  79, 301, 232, 289, 212, 250,  84, 130, 303, 256, 259, 204,
       524, 157, 231, 287, 247, 333, 229, 357, 361, 294,  86, 329, 142,
       184, 442, 233, 215, 394, 475, 197, 228, 297, 128, 241, 437, 178,
       428, 162, 234, 257, 219, 337, 382, 397, 181, 120, 380, 200, 433,
       222, 483, 154, 171, 146, 110, 201, 277, 268, 237, 102,  93, 354,
       195, 194, 238, 226, 318, 342, 266, 114, 245, 341, 421, 359, 565,
       319, 151, 267, 601, 567, 352, 284, 199,  80, 334, 389, 186, 246,
       589, 242, 143, 323, 535, 293, 398, 343, 255, 311, 446, 223, 262,
       422, 192, 217, 168, 299, 505, 400, 165, 183, 326, 298, 569, 374,
       216, 191, 408, 406, 452, 432, 312, 477, 396, 582, 358, 213, 467,
       331, 295, 235, 635, 385, 328, 522, 496, 415, 461, 344, 206, 368,
       321, 296, 373, 292, 383, 427, 189, 202,  96, 429, 431, 286, 508,
       210, 416, 553, 403, 225, 500, 313, 410, 273, 381, 330, 345, 253,
       258, 351, 353, 308, 278, 464, 509, 243, 173, 481, 281, 306, 577,
       302, 405, 571, 581, 550, 283, 612, 590, 541])
In [ ]:
# returns uniue negative experiences
df[df["Experience"] < 0]["Experience"].unique()
Out[ ]:
array([-1, -2, -3])
In [ ]:
# Returns total count of unique negative experiences
df[df["Experience"] < 0]["Experience"].value_counts()
Out[ ]:
Experience
-1    33
-2    15
-3     4
Name: count, dtype: int64
In [ ]:
# Correcting the experience values - making them to absolute number assuming its is a data entry error
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
In [ ]:
df.describe().T # Returns Statistical Summary after Experience column is fixed.
Out[ ]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.134600 11.415189 0.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Observations¶

Customers are in range 23 - 67 Years old

Max Experience is 43 Years with mean and median is 20 years

Minimum Income is 8K and max income is 224K. Mean is 73K ans median is 64K. We may see some outliers for the Salary

There are 52 rows in total where Experience in Ages is less than 0,

Max mortgage taken is 635K wheras median is 0 - shows outlier here

Average customer spends 0k to 10K on credit card where mean is little less than 2K (1.93) and median of 1.5K

There are total 467 unique Zip Codes

Outliers are expected for Income, Mortgage and CCAvg (May/May not - need tratment)

Negative Experiences are mostly for the 23 - 29 age group people

Looks like all negative experiences are data entry error

All negative experiences are replaced by its absolute values assuming its a data entry error

Sanity Checks¶

In [ ]:
# Returns percentage of customers who has CD Account
round((df[df['CD_Account'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
6.04
In [ ]:
# Returns percentage of customers who has Credit Card
round((df[df['CreditCard'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
29.4
In [ ]:
# Returns percentage of customers who uses Internet banking facilities
round((df[df['Online'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
59.68
In [ ]:
# Returns percentage of customers who has Securities Account with the bank
round((df[df['Securities_Account'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
10.44
In [ ]:
# Returns percentage of customers who accepted personal loan
round((df[df['Personal_Loan'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
9.6
In [ ]:
# Returns percentage of customers who has mortgage
round((df[df['Mortgage'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
0.0
In [ ]:
# Returns count of family memebers
df.groupby('Family')['ID'].count()
Out[ ]:
Family
1    1472
2    1296
3    1010
4    1222
Name: ID, dtype: int64
In [ ]:
# Returns education count
df.groupby('Education')['ID'].count()
Out[ ]:
Education
1    2096
2    1403
3    1501
Name: ID, dtype: int64
In [ ]:
# Returns Percent of customers who spent less than 5K
round((df[df['CCAvg'] < 5]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
92.72
In [ ]:
# Returns Percent of customers who spent less than 2K
round((df[df['CCAvg'] < 2]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
61.18

Observations¶

Only 6.04 percent customers have the CD Account

Only 29.4 percent customers use credit cards issued by other banls

Approx 60 percent customers use online banking facitlities

Only 9.6 percent customers have borrowed loan after the Campaign

Only 10.44 percent customers have the Securities Account

More than 90% customers spend less than 5K

More than 60% customers spend less than 2K

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [ ]:
# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
# function to create labeled barplots


def labeled_barplot(data, feature, hue, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        hue = hue,
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

Univariate Data Analysis.¶

Age¶

In [ ]:
# calls function to plot a boxplot and a histogram along the same scale for Age
histogram_boxplot(df, "Age", kde=True)
Observation¶

Age is well distributed in the dataset but has 5 spikes

Minimum is 23 and Max is 67

Mean and Median being 45

Experience¶

In [ ]:
# calls function to plot a boxplot and a histogram along the same scale for Experience
histogram_boxplot(df, "Experience")
Observations¶

Experience is well distributed with 4 spikes

Minimum Experience is 0 years wheras Max experience is 43 years

Mean and Median - both close to 20

Income¶

In [ ]:
# calls function to plot a boxplot and a histogram along the same scale for Income
histogram_boxplot(df, "Income", kde = True)
In [ ]:
# returns percent of customers which have income less than 100K

df[df['Income'] < 100]['ID'].count()/df['ID'].count()
Out[ ]:
0.7556
Observations¶

Income is heavily right Skewed. There are more customers with low income.

75 percent of customers have income less than 100K

Income ranges from 8K to 224K

Max Income(224K) is much higher than Q3 (98K)

Mean is 73K wheras Median is 64K (Median < Mean)

We see Ouliers for the Income

CCAvg¶

In [ ]:
# calls function to plot a boxplot and a histogram along the same scale for Credit Card Spend Average
histogram_boxplot(df, "CCAvg")
In [ ]:
round((df[df['CCAvg'] > 5]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
6.92
In [ ]:
round((df[df['CCAvg'] <=2  ]['ID'].count()/df['ID'].nunique()) *100,2)
Out[ ]:
64.94
Observations¶

CCAvg is heavily right Skewed. 50 percent cusomers spend less than 2K on credit card

Range of CCAvg vries from 0 to 10K Income ranges from 8K to 224K

Max CCAvg(10K) is much higher than Q3 (2.5K)

Less than 65 percent of customers spends 2K or less per month

less than 7 percent of customers spend more than 5000 dollars

Mean is 1.93K wheras Median is 1.5K

Lot of outliers in CCAvg on higher side

Mortgage¶

In [ ]:
# calls function to plot a boxplot and a histogram along the same scale for Mortgage
histogram_boxplot(df, "Mortgage")

Mortgage is heavily right Skewed.

Range of Mortgage vries from 0 to 635K

Max mortgage(635K) is much higher than Q3 (101K)

Median is Zero K

Lot of outliers in mortgage on higher side

Education¶

In [ ]:
# categorical plot for education
labeled_barplot(df, "Education", "Personal_Loan", perc=True)

Approx 40% customers are undergrads whereas Grads are little less than Advanced Professionals

Personal loan is more with Graduates and Advanced Professional

Family¶

In [ ]:
# categorical plot for Family
labeled_barplot(df, "Family", "Personal_Loan", perc=True)

Customers with Family size of 3 and 4 has a slight more chance of accepting personal loan

Credit Card¶

In [ ]:
# categorical plot for CreditCard
labeled_barplot(df, "CreditCard", "Personal_Loan", perc=True)

little less than 64% customers do not have credit cards from other bank also dont have personal loan

Online¶

In [ ]:
# categorical plot for Online
labeled_barplot(df, "Online", "Personal_Loan", perc=True)

53.9% customers have Online banking enabled

Personal Loan¶

In [ ]:
# categorical plot for Personal_Loan
labeled_barplot(df, "Personal_Loan", "Personal_Loan", perc=True)

Very few customers accepted the personal loan in the campaign

Only 9.6% customer have accepted the personal loan

Securities Account¶

In [ ]:
# categorical plot for Securities Account
labeled_barplot(df, "Securities_Account", "Personal_Loan", perc=True)

Most of the customers do not have securities account Approx little over 500 customers hold the Securities account in the bank

81.2% customers do not have Security Account

CD¶

In [ ]:
# categorical plot for CD_account
labeled_barplot(df, "CD_Account", "Personal_Loan", perc=True)

Observations¶

87.2% customers do not have Certified Deposits

CD_Account holders have more chance of accepting Personal loan

BiVariate Analysis¶

In [ ]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 3, 3))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [ ]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(8, 6))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])
    #sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
    )

    plt.tight_layout()
    plt.show()

Corelation check¶

In [ ]:
# corelation among variables
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") #Returns the heatmap of the data
plt.show()
In [ ]:
# Pair plot for continuous variables
sns.pairplot(data=df, vars=['Age', 'Income', 'Mortgage', 'CCAvg' ], hue='Personal_Loan');
Observations¶

Age and Experience are highly corelated and hence experience column can be dropped.

Income and Credit Card average are positively corelated

Customers who have more income are more likely to borrow personal loan as they are positively corelated

Other factors to consider for Personal loan could be CD_Account apart from Income and CCAvg

Personal Loan¶

In [ ]:
# calls stacked bar plot function for different categorical parameters wrt Personal loan
stacked_barplot(df, "Family", "Personal_Loan")
stacked_barplot(df, "Education", "Personal_Loan")
stacked_barplot(df, "Securities_Account", "Personal_Loan")
stacked_barplot(df, "CD_Account", "Personal_Loan")
stacked_barplot(df, "Online", "Personal_Loan")
stacked_barplot(df, "CreditCard", "Personal_Loan")
Personal_Loan     0    1   All
Family                        
All            4520  480  5000
4              1088  134  1222
3               877  133  1010
1              1365  107  1472
2              1190  106  1296
------------------------------------------------------------------------------------------------------------------------
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------
Personal_Loan          0    1   All
Securities_Account                 
All                 4520  480  5000
0                   4058  420  4478
1                    462   60   522
------------------------------------------------------------------------------------------------------------------------
Personal_Loan     0    1   All
CD_Account                    
All            4520  480  5000
0              4358  340  4698
1               162  140   302
------------------------------------------------------------------------------------------------------------------------
Personal_Loan     0    1   All
Online                        
All            4520  480  5000
1              2693  291  2984
0              1827  189  2016
------------------------------------------------------------------------------------------------------------------------
Personal_Loan     0    1   All
CreditCard                    
All            4520  480  5000
0              3193  337  3530
1              1327  143  1470
------------------------------------------------------------------------------------------------------------------------
In [ ]:
# calls distribution plot function for different parameters wrt Personal loan

distribution_plot_wrt_target(df, "Age", "Personal_Loan")
distribution_plot_wrt_target(df, "Experience", "Personal_Loan")
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
distribution_plot_wrt_target(df, "ZIPCode", "Personal_Loan")
distribution_plot_wrt_target(df, "CCAvg", "Personal_Loan")
distribution_plot_wrt_target(df, "Mortgage", "Personal_Loan")
Observations¶

Customers with family size of 3-4 has taken personal loan than family size of 1 or 2

Cusomers with more Advanced/Professional or Graduate are little more likelier to take the loan than UnderGrads

Customers having higher income are more likeler to take personal loan

Majority customers who have high income (100K and above) have borrowed personal loans

Customers have no patterns with Personal loan vs age and experience

Customers with higher mortgage are more likelier to take personal loan

Customers with higher expenditure (CCAvg) are more likelier to take personal loan

Approx 30 percent customers with CD Accounts have borrowed personal loan

Approx 87% customers do not own CD Account (4358/5000) and have not borrowed personal loan

Customers who use online/internet banking faciliites have no impact on Personal loan

Customers who did have personal loan is less liklier to use credit cards from other banks

Income vs Education, Family and CD_Account¶
In [ ]:
sns.boxplot(data=df,y='Income',x='Education',hue='Personal_Loan');

As Education level increases, Mean Income also increases for the Customers who have personal loans

In [ ]:
sns.boxplot(data=df,y='Income',x='Family',hue='Personal_Loan');

Income level among all Family groups is significantly higher for customers who have a Personal Loan.

There are several outliers in Family size 1 and 2 for customers who don't have a Personal loan compared to the rest.

In [ ]:
sns.boxplot(data=df,y='Income',x='CD_Account',hue='Personal_Loan');

High Income customers own CD_accounts than the low income one if we ignore the outliers

More CD_accounts customers are more likely to borrow personal loan

Mortgage vs Education, Family and CD_Account¶
In [ ]:
sns.boxplot(data=df,y='Mortgage',x='Family',hue='Personal_Loan');

As Family size increases, Customers are more likely to borrow personal loan if we ignore ignoring outliers

In [ ]:
sns.boxplot(data=df,y='Mortgage',x='Education',hue='Personal_Loan');

Education level 1(undergrads) have more mortgage than grad and professional level ignoring outliers

In [ ]:
sns.boxplot(data=df,y='Mortgage',x='CD_Account',hue='Personal_Loan');
Observations¶

Observations on Patterns People having higher income have taken personal loan

People with 2 - 4 family members are likelier to take personal loan

People with high mortgages opted for loan.

People with higher credit card average opted for personal loan

People with higher mortgage opted for personal loan

Number of Customers with Advanced/Professional education level has borrowed personal loan more than Graduated and Under-grads

Number of Customers with Family Size 3 or More has borrowed personal loan more than other people

60 of those who had Personal loan with the bank also had Securities Account.

Almost 50% of customers having Certified Deposit, had borrowed Personal Loan. However, 4358 customers out of 5000, do not have Certified Deposit Account and did not borrow Personal Loan, which means if a customer does not have a CD Account, is likely not to take Personal Loan

Majority customers who did have Personal Loan with the bank did not use Credit Card from other banks.

Observations on EDA¶

Questions:

What is the distribution of mortgage attribute?

Mortgage is heavily right Skewed. Range of Mortgage vries from 0 to 635K Max mortgage(635K) is much higher than Q3 (101K) Median is Zero K Lot of outliers in mortgage on higher side

Are there any noticeable patterns or outliers in the distribution?

Yes we do see outliers for Income, Mortgage and CCavg Income and CCAvg are higjly positively corelated (0.67) - Customers spend more if the income is high

How many customers have credit cards?

Total 1470 customers have the credit card from other bank. Out of 1470 - 1327 customers do not have personal loan and 143 customer do have personal loan

What are the attributes that have a strong correlation with the target attribute (personal loan)?

Storng Positive corelation for the attribut Personal_loan is with Income (close to 0.5) followed by CCAvg (0.37)

How does a customer's interest in purchasing a loan vary with their age?

Age has no corelation with purchasing a personal_loan

How does a customer's interest in purchasing a loan vary with their education? Cusomers with more Advanced/Professional or Graduate are little more likelier to take the loan than UnderGrads

Customer with Advanced degree (3) have approx 15% acceptance rate for Personal loan Customer with Advanced degree (2) have approx 14.9% acceptance rate for Personal loan Customer with Advanced degree (1) have approx 4.64% acceptance rate for Personal loan

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [ ]:
dfcopy= df.copy() # create a copy of data in case we need to restore
dfcopy.info() # Test if copy is all good
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Drop Data Columns¶

In [ ]:
# Dropping ID as ita unique ID
# Dropping Experience as Age and Experience are highly corelated
# ID is a unique identifier not dependent on personal Loan
dfcopy.drop(['Experience','ID'], axis=1,inplace=True)
# Dropiing Zipcode as it doesnt provide much insight information
dfcopy.drop(['ZIPCode'], axis=1,inplace=True)
In [ ]:
dfcopy.describe()
Out[ ]:
Age Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 45.338400 73.774200 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 11.463166 46.033729 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 23.000000 8.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 35.000000 39.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 45.000000 64.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 55.000000 98.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 67.000000 224.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000

Outlier Detection¶

In [ ]:
# outlier detection using boxplot

numeric_columns = dfcopy.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
In [ ]:
# functions to treat outliers

def treat_outliers(data, column):
    """
    Treats outliers in a variable

    df: dataframe
    col: dataframe column
    """
    Q1 = data[column].quantile(0.25)  # 25th quantile
    Q3 = data[column].quantile(0.75)  # 75th quantile
    IQR = Q3 - Q1
    Lower_Whisker = Q1 - 1.5 * IQR
    Upper_Whisker = Q3 + 1.5 * IQR

    # all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
    # all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker

    data[column] = np.clip(data[column], Lower_Whisker, Upper_Whisker)

    return data
In [ ]:
# Treat outliers in a list of variables
def treat_outliers_all(data, col_list):
    """
    Treat outliers in a list of variables

    data: dataframe
    col_list: list of dataframe columns
    """
    for col in col_list:
        data = treat_outliers(df, col)

    return data
In [ ]:
# Following code need to be executed only if we need to fix all outliers. Currently we are moving ahead with Outliers
#numerical_col = dfcopy.select_dtypes(include=np.number).columns.tolist()
#data = treat_outliers_all(dfcopy, numerical_col)

Observations¶

  • There are quite a lot outliers in the data mainly in Income , Mortgage and CCavg
  • However, we will not treat them as they are proper values

Data Preparation for Modeling¶

In [ ]:
# Create dummy variables
# Use one Hot Encoding for columns where value is 0 and 1 and for Family and Education
dummy_data = pd.get_dummies(dfcopy, columns=["Education", 'Securities_Account','CD_Account','Online' , 'CreditCard','Family'], drop_first=True)
dummy_data.head()
Out[ ]:
Age Income CCAvg Mortgage Personal_Loan Education_2 Education_3 Securities_Account_1 CD_Account_1 Online_1 CreditCard_1 Family_2 Family_3 Family_4
0 25 49 1.6 0 0 False False True False False False False False True
1 45 34 1.5 0 0 False False True False False False False True False
2 39 11 1.0 0 0 False False False False False False False False False
3 35 100 2.7 0 0 True False False False False False False False False
4 35 45 1.0 0 0 True False False False False True False False True
In [ ]:
# We will split the dataset into dependent and independent variable sets
X = dummy_data.drop(["Personal_Loan"], axis=1)
Y = dummy_data["Personal_Loan"]
In [ ]:
X.info() # returns  summary of Independent features in a data set
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   5000 non-null   int64  
 1   Income                5000 non-null   int64  
 2   CCAvg                 5000 non-null   float64
 3   Mortgage              5000 non-null   int64  
 4   Education_2           5000 non-null   bool   
 5   Education_3           5000 non-null   bool   
 6   Securities_Account_1  5000 non-null   bool   
 7   CD_Account_1          5000 non-null   bool   
 8   Online_1              5000 non-null   bool   
 9   CreditCard_1          5000 non-null   bool   
 10  Family_2              5000 non-null   bool   
 11  Family_3              5000 non-null   bool   
 12  Family_4              5000 non-null   bool   
dtypes: bool(9), float64(1), int64(3)
memory usage: 200.3 KB
In [ ]:
Y.info() # returns  summary of of dependent variable which is Personal loan
<class 'pandas.core.series.Series'>
RangeIndex: 5000 entries, 0 to 4999
Series name: Personal_Loan
Non-Null Count  Dtype
--------------  -----
5000 non-null   int64
dtypes: int64(1)
memory usage: 39.2 KB
In [ ]:
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1)
In [ ]:
# Print summary for train and test data set
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True)*100)
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True)*100)
Shape of Training set :  (3500, 13)
Shape of test set :  (1500, 13)
Percentage of classes in training set:
Personal_Loan
0    90.542857
1     9.457143
Name: proportion, dtype: float64
Percentage of classes in test set:
Personal_Loan
0    90.066667
1     9.933333
Name: proportion, dtype: float64

Obervations¶

We have split the dataset into Training and Testing dataset.

In both datasets, the target variable has around 90:10 distribution for the values 0 and 1 respectively.

Model Building¶

Model Evaluation Criterion¶

Case Predictions:

Case 1 - Predicting a Customer will buy a loan but in reality he actually doesn't buy loan - Loss of Resource (FP)

Case 2 - Predicting a Customer will not buy a loan but he in reality customer does buy a loan - Loss of Opportunity (FN)

Which case is more important

The purpose of Loan Campaign is to bring more customers. If a customer is missed by sales team is an important criteria here so Case 2 is more important

How to reduce this loss of Oppurtinity or False Negative(FN)

The company would want the recall to be maximized, greater the recall score higher are the chances of minimizing the False Negatives.

In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [ ]:
# List all features in a list callled feature_names
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'CCAvg', 'Mortgage', 'Education_2', 'Education_3', 'Securities_Account_1', 'CD_Account_1', 'Online_1', 'CreditCard_1', 'Family_2', 'Family_3', 'Family_4']

Default Decision Tree¶

Build Default Decision Tree Model

Build tree using DecisionTreeClassifier function without class weights

In [ ]:
# Build a default Decision Tree
model = DecisionTreeClassifier(criterion = 'gini', random_state=1)
model.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
In [ ]:
# Build Confusion matrix for train and test model without weights
confusion_matrix_sklearn(model, X_train, y_train)
confusion_matrix_sklearn(model, X_test, y_test)
In [ ]:
# Print different metrics for default tree for taining data
decision_tree_perf_train_without = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train_without
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
# Print different metrics for default tree for testing data
decision_tree_perf_test_without = model_performance_classification_sklearn(
    model, X_test, y_test
)
decision_tree_perf_test_without
Out[ ]:
Accuracy Recall Precision F1
0 0.979333 0.899329 0.893333 0.896321

Visualizing the Default Decision Tree¶

In [ ]:
# Check the depth of the tree
model.get_depth()
Out[ ]:
10
In [ ]:
# Shows the tree
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
    model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [ ]:
# Report showing the rules of a decision tree
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family_4 <= 0.50
|   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |--- Mortgage <= 231.00
|   |   |   |   |   |   |--- CCAvg <= 1.95
|   |   |   |   |   |   |   |--- weights: [14.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  1.95
|   |   |   |   |   |   |   |--- CCAvg <= 2.65
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  2.65
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Mortgage >  231.00
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Family_4 >  0.50
|   |   |   |   |--- Age <= 32.50
|   |   |   |   |   |--- CCAvg <= 2.40
|   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.40
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  32.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |--- Age <= 37.50
|   |   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  37.50
|   |   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |   |--- weights: [23.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- Family_3 <= 0.50
|   |   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |   |--- weights: [53.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |   |--- Mortgage <= 99.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Mortgage >  99.50
|   |   |   |   |   |   |   |   |   |--- weights: [21.00, 0.00] class: 0
|   |   |   |   |   |   |--- Family_3 >  0.50
|   |   |   |   |   |   |   |--- Online_1 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Online_1 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Education_3 <= 0.50
|   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |   |   |--- Online_1 <= 0.50
|   |   |   |   |   |   |   |   |--- Income <= 102.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Income >  102.00
|   |   |   |   |   |   |   |   |   |--- Family_3 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Family_3 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Online_1 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |   |   |--- CCAvg <= 4.20
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  4.20
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |   |   |--- Income <= 93.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  93.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |--- weights: [0.00, 10.00] class: 1
|   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |--- Education_3 >  0.50
|   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 14.00] class: 1
|   |   |   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |   |   |--- Family_3 <= 0.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- Family_3 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |--- Family_3 <= 0.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- Family_3 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|--- Income >  116.50
|   |--- Education_3 <= 0.50
|   |   |--- Education_2 <= 0.50
|   |   |   |--- Family_3 <= 0.50
|   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |--- weights: [0.00, 14.00] class: 1
|   |   |   |--- Family_3 >  0.50
|   |   |   |   |--- weights: [0.00, 33.00] class: 1
|   |   |--- Education_2 >  0.50
|   |   |   |--- weights: [0.00, 108.00] class: 1
|   |--- Education_3 >  0.50
|   |   |--- weights: [0.00, 114.00] class: 1

In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
    pd.DataFrame(
        model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                           Imp
Income                0.309934
Education_2           0.240847
Education_3           0.165787
Family_3              0.103121
Family_4              0.062971
CCAvg                 0.050616
Age                   0.026589
CD_Account_1          0.026348
Mortgage              0.010725
Online_1              0.003063
Securities_Account_1  0.000000
CreditCard_1          0.000000
Family_2              0.000000
In [ ]:
#Visualizes the important features
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(6, 6))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations¶

Model is able to perfectly classify all the data points on the training data set

No errors on the training dataset, each sample is classified correctly resulting in 100% recall score and precision score

As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.

This generally leads to overfitting of the model as Decision tree will perform best on training set.

Recall score on the Default decision tree performed 89.93% on recall score on its test data set

Model is giving 100% on training and 89.93% on test set.

As per default decision tree model without class weights - Income is the most important feature, followed by Education and Family

Default Decision tree with Class weights¶

In [ ]:
# Build a default Decision Tree
weightmodel = DecisionTreeClassifier(criterion = 'gini', class_weight = {0:10,1:90}, random_state=1)
weightmodel.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(class_weight={0: 10, 1: 90}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 10, 1: 90}, random_state=1)
In [ ]:
# Build Confusion matrix for train and test model with weights
confusion_matrix_sklearn(weightmodel, X_train, y_train)
confusion_matrix_sklearn(weightmodel, X_test, y_test)
In [ ]:
# Print different metrics for default tree with class weight for taining data
decision_tree_perf_train = model_performance_classification_sklearn(
    weightmodel, X_train, y_train
)
decision_tree_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
# Print different metrics for default tree with class weight for testing data
decision_tree_perf_test = model_performance_classification_sklearn(
    weightmodel, X_test, y_test
)
decision_tree_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.978 0.85906 0.914286 0.885813

Visualizing the Default Decision Tree with Weights¶

In [ ]:
# Check the depth of the tree
weightmodel.get_depth()
Out[ ]:
15
In [ ]:
# Shows the tree
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
    weightmodel,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [ ]:
# Report showing the rules of a decision tree -

print(tree.export_text(weightmodel, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [24350.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account_1 <= 0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- Mortgage <= 102.50
|   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |--- weights: [150.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |   |   |--- Income <= 67.00
|   |   |   |   |   |   |   |   |--- weights: [80.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  67.00
|   |   |   |   |   |   |   |   |--- Securities_Account_1 <= 0.50
|   |   |   |   |   |   |   |   |   |--- Income <= 84.00
|   |   |   |   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 450.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  84.00
|   |   |   |   |   |   |   |   |   |   |--- Income <= 90.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [50.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Income >  90.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |--- Securities_Account_1 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [40.00, 0.00] class: 0
|   |   |   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |   |   |--- weights: [130.00, 0.00] class: 0
|   |   |   |   |--- Mortgage >  102.50
|   |   |   |   |   |--- weights: [210.00, 0.00] class: 0
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- weights: [420.00, 0.00] class: 0
|   |   |--- CD_Account_1 >  0.50
|   |   |   |--- weights: [0.00, 450.00] class: 1
|--- Income >  92.50
|   |--- Education_3 <= 0.50
|   |   |--- Education_2 <= 0.50
|   |   |   |--- Family_3 <= 0.50
|   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |   |--- CCAvg <= 3.21
|   |   |   |   |   |   |   |--- weights: [400.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.21
|   |   |   |   |   |   |   |--- Income <= 97.00
|   |   |   |   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  97.00
|   |   |   |   |   |   |   |   |--- Age <= 39.00
|   |   |   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  39.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 270.00] class: 1
|   |   |   |   |   |--- Income >  103.50
|   |   |   |   |   |   |--- weights: [4330.00, 0.00] class: 0
|   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |--- Income <= 93.50
|   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  93.50
|   |   |   |   |   |   |--- Income <= 102.00
|   |   |   |   |   |   |   |--- CreditCard_1 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CreditCard_1 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 90.00] class: 1
|   |   |   |   |   |   |--- Income >  102.00
|   |   |   |   |   |   |   |--- weights: [0.00, 1710.00] class: 1
|   |   |   |--- Family_3 >  0.50
|   |   |   |   |--- Income <= 108.50
|   |   |   |   |   |--- weights: [110.00, 0.00] class: 0
|   |   |   |   |--- Income >  108.50
|   |   |   |   |   |--- Age <= 26.00
|   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |--- Age >  26.00
|   |   |   |   |   |   |--- Income <= 118.00
|   |   |   |   |   |   |   |--- Online_1 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 180.00] class: 1
|   |   |   |   |   |   |   |--- Online_1 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  118.00
|   |   |   |   |   |   |   |--- weights: [0.00, 2970.00] class: 1
|   |   |--- Education_2 >  0.50
|   |   |   |--- Income <= 110.50
|   |   |   |   |--- CCAvg <= 2.90
|   |   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |   |--- weights: [400.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  106.50
|   |   |   |   |   |   |--- Age <= 52.00
|   |   |   |   |   |   |   |--- weights: [50.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  52.00
|   |   |   |   |   |   |   |--- Family_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 90.00] class: 1
|   |   |   |   |   |   |   |--- Family_3 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  2.90
|   |   |   |   |   |--- Age <= 55.00
|   |   |   |   |   |   |--- weights: [0.00, 540.00] class: 1
|   |   |   |   |   |--- Age >  55.00
|   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |--- Income >  110.50
|   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |--- Mortgage <= 141.50
|   |   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |   |--- CCAvg <= 1.20
|   |   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  1.20
|   |   |   |   |   |   |   |   |--- CCAvg <= 2.65
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 1.75
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 180.00] class: 1
|   |   |   |   |   |   |   |   |   |--- CCAvg >  1.75
|   |   |   |   |   |   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  2.65
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 450.00] class: 1
|   |   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |   |   |   |--- Mortgage >  141.50
|   |   |   |   |   |   |--- weights: [40.00, 0.00] class: 0
|   |   |   |   |--- Income >  116.50
|   |   |   |   |   |--- weights: [0.00, 9720.00] class: 1
|   |--- Education_3 >  0.50
|   |   |--- Income <= 116.50
|   |   |   |--- CCAvg <= 2.35
|   |   |   |   |--- Mortgage <= 236.00
|   |   |   |   |   |--- weights: [400.00, 0.00] class: 0
|   |   |   |   |--- Mortgage >  236.00
|   |   |   |   |   |--- Income <= 110.00
|   |   |   |   |   |   |--- weights: [40.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  110.00
|   |   |   |   |   |   |--- Age <= 34.50
|   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  34.50
|   |   |   |   |   |   |   |--- weights: [0.00, 180.00] class: 1
|   |   |   |--- CCAvg >  2.35
|   |   |   |   |--- Age <= 64.00
|   |   |   |   |   |--- CCAvg <= 2.95
|   |   |   |   |   |   |--- CCAvg <= 2.55
|   |   |   |   |   |   |   |--- CreditCard_1 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 180.00] class: 1
|   |   |   |   |   |   |   |--- CreditCard_1 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.55
|   |   |   |   |   |   |   |--- weights: [60.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.95
|   |   |   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1260.00] class: 1
|   |   |   |   |   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |   |   |   |   |--- Age <= 51.50
|   |   |   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  51.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 90.00] class: 1
|   |   |   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |   |   |--- Age <= 39.00
|   |   |   |   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  39.00
|   |   |   |   |   |   |   |   |--- Mortgage <= 199.00
|   |   |   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Mortgage >  199.00
|   |   |   |   |   |   |   |   |   |--- Income <= 97.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Income >  97.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 270.00] class: 1
|   |   |   |   |--- Age >  64.00
|   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 10260.00] class: 1

In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        weightmodel.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                           Imp
Income                0.633250
CCAvg                 0.094742
Family_4              0.080839
Education_2           0.066285
Family_3              0.063748
Education_3           0.023947
Mortgage              0.013352
Age                   0.011674
CD_Account_1          0.007908
Securities_Account_1  0.001879
CreditCard_1          0.001203
Online_1              0.001172
Family_2              0.000000
In [ ]:
#Visualizes the important features
importances = weightmodel.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(6, 6))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations¶

As per default decision tree model with class weights Income is the most important feature, followed by CCAvg and Family

It is a very complex tree, and appears to be over-fitting, recall score on train set is 100%, and on test is 85.90%

Tree is much more complex after adding weights with depth of 15

Decision Tree - Prepruning with Max Depth 6¶

Limiting the max depth to 6 and making tree more simpler

Default Decision tree was build with max_depth 10 - Prepruning it to little more tha 50%

In [ ]:
# Build a Decision Tree with max depth as 6 (little more than 50 % of default tree) -
limitmodel = DecisionTreeClassifier(criterion = 'gini',max_depth=6,random_state=1)
limitmodel.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(max_depth=6, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, random_state=1)
In [ ]:
# Build Confusion matrix for train and test model
confusion_matrix_sklearn(limitmodel, X_train, y_train)
confusion_matrix_sklearn(limitmodel, X_test, y_test)
In [ ]:
# Print different metrics for prePrune tree for taining data
decision_tree_limit_perf_train = model_performance_classification_sklearn(limitmodel, X_train, y_train)
decision_tree_limit_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 0.994857 0.94864 0.996825 0.972136
In [ ]:
# Print different metrics for Pre Prune tree for test data
decision_tree_limit_perf_test = model_performance_classification_sklearn(limitmodel, X_test, y_test)
decision_tree_limit_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.981333 0.872483 0.935252 0.902778

Visualizing the Decision Tree with max depth as 6¶

In [ ]:
# Check the depth of the tree
limitmodel.get_depth()
Out[ ]:
6
In [ ]:
# Shows the tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    limitmodel,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [ ]:
# Report showing the rules of a decision tree
print(tree.export_text(limitmodel, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family_4 <= 0.50
|   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- weights: [5.00, 1.00] class: 0
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |--- Mortgage <= 231.00
|   |   |   |   |   |   |--- weights: [15.00, 1.00] class: 0
|   |   |   |   |   |--- Mortgage >  231.00
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Family_4 >  0.50
|   |   |   |   |--- Age <= 32.50
|   |   |   |   |   |--- CCAvg <= 2.40
|   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.40
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  32.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- weights: [40.00, 7.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- weights: [77.00, 2.00] class: 0
|   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Education_3 <= 0.50
|   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |   |   |--- weights: [33.00, 4.00] class: 0
|   |   |   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |   |   |--- weights: [1.00, 5.00] class: 1
|   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |--- weights: [0.00, 10.00] class: 1
|   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |--- Education_3 >  0.50
|   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 14.00] class: 1
|   |   |   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |   |   |--- weights: [2.00, 1.00] class: 0
|   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |--- weights: [5.00, 1.00] class: 0
|   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|--- Income >  116.50
|   |--- Education_3 <= 0.50
|   |   |--- Education_2 <= 0.50
|   |   |   |--- Family_3 <= 0.50
|   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |--- weights: [0.00, 14.00] class: 1
|   |   |   |--- Family_3 >  0.50
|   |   |   |   |--- weights: [0.00, 33.00] class: 1
|   |   |--- Education_2 >  0.50
|   |   |   |--- weights: [0.00, 108.00] class: 1
|   |--- Education_3 >  0.50
|   |   |--- weights: [0.00, 114.00] class: 1

In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        limitmodel.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                           Imp
Income                0.317812
Education_2           0.248480
Education_3           0.174878
Family_3              0.099498
Family_4              0.051526
CCAvg                 0.044703
CD_Account_1          0.027792
Age                   0.027472
Mortgage              0.007840
Securities_Account_1  0.000000
Online_1              0.000000
CreditCard_1          0.000000
Family_2              0.000000
In [ ]:
#Visualizes the important features
importances = limitmodel.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations¶

As per decision tree with max depth as 6 model - Income is the most important feature, followed by Education and Family size

The tree is simpler but with recall score of 94.86% on train set and 87.24 on test data set

Decision tree - Pre pruning using GridSearch¶

Using GridSearch for hyperpaprameter tuning of our tree model

Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.

In [ ]:
# Choose the type of classifier.
premodel = DecisionTreeClassifier(criterion = 'gini', random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(3, 10),
    "min_samples_leaf": [1, 2, 5, 7, 10],
    "max_leaf_nodes": [2, 3, 5, 10],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(premodel, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
premodel = grid_obj.best_estimator_

# Fit the best algorithm to the data.
premodel.fit(X_train,y_train)
Out[ ]:
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10, min_samples_leaf=5,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10, min_samples_leaf=5,
                       random_state=1)
In [ ]:
# Build Confusion matrix for train and test model without weights
confusion_matrix_sklearn(premodel, X_train, y_train)
confusion_matrix_sklearn(premodel, X_test, y_test)
In [ ]:
# Print different metrics for PrePrune tree for taining data
decision_tree_pre_perf_train = model_performance_classification_sklearn(premodel, X_train, y_train)
decision_tree_pre_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 0.986857 0.882175 0.976589 0.926984
In [ ]:
# Print different metrics for PrePrune tree for test data
decision_tree_pre_perf_test = model_performance_classification_sklearn(premodel, X_test, y_test)
decision_tree_pre_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.974667 0.791946 0.944 0.861314

Visualizing the Pre Prune Tree¶

In [ ]:
# Check the depth of the tree
premodel.get_depth()
Out[ ]:
5
In [ ]:
# Shows the tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    premodel,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [ ]:
# Report showing the rules of a decision tree
print(tree.export_text(premodel, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2632.00, 10.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |--- weights: [117.00, 10.00] class: 0
|   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Education_3 <= 0.50
|   |   |   |   |--- weights: [38.00, 19.00] class: 0
|   |   |   |--- Education_3 >  0.50
|   |   |   |   |--- weights: [7.00, 18.00] class: 1
|--- Income >  116.50
|   |--- Education_3 <= 0.50
|   |   |--- Education_2 <= 0.50
|   |   |   |--- Family_3 <= 0.50
|   |   |   |   |--- Family_4 <= 0.50
|   |   |   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |   |   |--- Family_4 >  0.50
|   |   |   |   |   |--- weights: [0.00, 14.00] class: 1
|   |   |   |--- Family_3 >  0.50
|   |   |   |   |--- weights: [0.00, 33.00] class: 1
|   |   |--- Education_2 >  0.50
|   |   |   |--- weights: [0.00, 108.00] class: 1
|   |--- Education_3 >  0.50
|   |   |--- weights: [0.00, 114.00] class: 1

In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        premodel.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                           Imp
Income                0.335478
Education_2           0.258373
Education_3           0.188599
Family_3              0.107563
Family_4              0.051352
CCAvg                 0.043100
CD_Account_1          0.015535
Age                   0.000000
Mortgage              0.000000
Securities_Account_1  0.000000
Online_1              0.000000
CreditCard_1          0.000000
Family_2              0.000000
In [ ]:
#Visualizes the important features
importances = premodel.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
In [ ]:
 

Observations¶

In this case Pre pruning a Decision tree using grid search parametrs Income is the most important feature followed by Education and Family

It is a much simpler Decision Tree with max_depth of only 5

Recall score of 88.21% on train set and 79.19% on test set

Cost Complexity¶

In [ ]:
# Creates a tree and return ccp_alpha with impurities
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [ ]:
pd.DataFrame(path).T
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
ccp_alphas 0.0 0.000268 0.000268 0.000275 0.000278 0.000359 0.000381 0.000381 0.000476 0.000476 ... 0.000882 0.001552 0.001552 0.002333 0.003294 0.006473 0.007712 0.016154 0.032821 0.047088
impurities 0.0 0.000536 0.001609 0.002710 0.003824 0.004900 0.005280 0.005661 0.006138 0.006614 ... 0.016352 0.017903 0.022560 0.024893 0.028187 0.034659 0.042372 0.058525 0.124167 0.171255

2 rows × 26 columns

In [ ]:
# Plot the ccp_alpha against impurities
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Train the decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [ ]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596768

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [ ]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(16,12))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
In [ ]:
# Recall vs alpha on train set
recall_train = []
for clf in clfs:
  pred_train = clf.predict(X_train)
  values_train = recall_score(y_train,pred_train)
  recall_train.append(values_train)
In [ ]:
# Recall vs alpha on test set
recall_test = []
for clf in clfs:
  pred_test = clf.predict(X_test)
  values_test = recall_score(y_test,pred_test)
  recall_test.append(values_test)
In [ ]:
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [ ]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)

Post Pruning¶

In [ ]:
# Build a Decision Tree using ccp_alpha above with class weights
postmodel = DecisionTreeClassifier(
    ccp_alpha=0.04708834100596768, class_weight={0: 0.10, 1: 0.90}, random_state=1
)
postmodel.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.04708834100596768,
                       class_weight={0: 0.1, 1: 0.9}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.04708834100596768,
                       class_weight={0: 0.1, 1: 0.9}, random_state=1)
In [ ]:
# Build Confusion matrix for train and test model
confusion_matrix_sklearn(postmodel, X_train, y_train)
confusion_matrix_sklearn(postmodel, X_test, y_test)
In [ ]:
# Print different metrics for Decision tree after post pruning for taining data
decision_tree_post_perf_train = model_performance_classification_sklearn(
    postmodel, X_train, y_train
)
decision_tree_post_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 0.819429 0.954683 0.338692 0.5
In [ ]:
# Print different metrics for Decision tree after post pruning for test data
decision_tree_post_perf_test = model_performance_classification_sklearn(
    postmodel, X_test, y_test
)
decision_tree_post_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.8 0.932886 0.324009 0.480969

Visualizing the Post Prune tree¶

In [ ]:
# Plot the tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    postmodel,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [ ]:
# Report showing the rules of a decision tree
print(tree.export_text(postmodel, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- weights: [255.20, 13.50] class: 0
|--- Income >  92.50
|   |--- weights: [61.70, 284.40] class: 1

In [ ]:
#Visualizes the important features which should be only income now
importances = postmodel.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations¶

Income is most important fearture

In this case of post pruning a Decision tree Income is the most important feature for the customer

It is a very simple Decision with only 1 depth

Recall score of 95.46% on train set and 93.28% on test set

Model is giving good and generalized results on training and test set.

Model Performance Improvement¶

Model Comparison and Final Model Selection¶

In [ ]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train_without.T,
        decision_tree_perf_train.T,
        decision_tree_limit_perf_train.T,
        decision_tree_pre_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree without class_weight",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning with Limit)",
    "Decision Tree (Pre-Pruning with GridSearch)",
    "Decision Tree (Post-Pruning)",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Decision Tree without class_weight Decision Tree with class_weight Decision Tree (Pre-Pruning with Limit) Decision Tree (Pre-Pruning with GridSearch) Decision Tree (Post-Pruning)
Accuracy 1.0 1.0 0.994857 0.986857 0.819429
Recall 1.0 1.0 0.948640 0.882175 0.954683
Precision 1.0 1.0 0.996825 0.976589 0.338692
F1 1.0 1.0 0.972136 0.926984 0.500000
In [ ]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test_without.T,
        decision_tree_perf_test.T,
        decision_tree_limit_perf_test.T,
        decision_tree_pre_perf_test.T,
        decision_tree_post_perf_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree without class_weight",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning with Limit)",
    "Decision Tree (Pre-Pruning with GridSearch)",
    "Decision Tree (Post-Pruning)",
]

print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[ ]:
Decision Tree without class_weight Decision Tree with class_weight Decision Tree (Pre-Pruning with Limit) Decision Tree (Pre-Pruning with GridSearch) Decision Tree (Post-Pruning)
Accuracy 0.979333 0.978000 0.981333 0.974667 0.800000
Recall 0.899329 0.859060 0.872483 0.791946 0.932886
Precision 0.893333 0.914286 0.935252 0.944000 0.324009
F1 0.896321 0.885813 0.902778 0.861314 0.480969

Observations/ Conclusions¶

We analyzed Personal Loan campaign using decision tree classifier to build a predictive model

The model can be used to predict if customer is going to borrow a personal loan or not

We visualized different trees and their confusion matrix to get better understanding of model

We did not remove the outliers and still got a simple post pruned decision tree

Income followed by Education and Family are most important factors to predict if customer is going to borrow a loan/not

Decision tree with post-pruning has given the highest recall score 95.46% on train set and (93.28%) on the test set.

The pre pruned and the post post pruned models have reduced overfitting and the model is giving generalized performance

The tree with post pruning much simpler and much easy to interpret.

Actionable Insights and Business Recommendations¶

What recommedations would you suggest to the bank?

The goal of sales team should be to minimize Loss of Oppurtinity i.e minimal chance of predicting customer will not purchase a loan but in reality customer do buy the loan. This is achieved by minimizing false negatives or maximizing Recall score

We used the decision tree model using default decision tree, used class weights in default decision tree then used pre pruning and post pruning decision tree

If we consider Accuracy score - Prepruned decission tree model is a way to go but if we have to consider recall score then we should use Post Pruned decision tree model

Marketing team should look at most imprtant features which is Income in predicting whether customer will borrow a loan or not. After Income - customers with Graduate or Advanced degree, customers having higher Family Size are some of the most important points for predicting the probability of a customer borrowing Personal Loan

Customers with High Income have high Expenditures whereas Customers with low income have low expenses

Sales /Marketing team should establish dedicated relationship with high profile customers